docs: target Context7 benchmark gaps in Python skills [no-ci]#699
Merged
lukeocodes merged 2 commits intomainfrom Apr 27, 2026
Merged
docs: target Context7 benchmark gaps in Python skills [no-ci]#699lukeocodes merged 2 commits intomainfrom
lukeocodes merged 2 commits intomainfrom
Conversation
The Context7 benchmark for /deepgram/deepgram-python-sdk scores the SDK against 10 standardized prompts (rubric: implementation 40 + accuracy 25 + relevance 20 + completeness 10 + clarity 5 = 100). Current score: 88.8. Four prompts had the largest gaps: Prompt #1 (66/100) - Voice agent dynamic adjustment + stream restart Eval said the skill 'lacks specific guidance or API methods for dynamically adjusting transcription parameters during an active connection or for intelligently managing stream restarts and pauses beyond basic error events'. deepgram-python-voice-agent/SKILL.md: - New 'Dynamic mid-session adjustment' section with runnable code for send_update_prompt, send_update_speak, send_update_think, send_inject_agent_message, send_inject_user_message, send_keep_alive (sync + async equivalents). - New 'Stream lifecycle & recovery' section covering KeepAlive on idle, pause/resume audio, reconnect after disconnect with conversation history replay via AgentV1SettingsAgentContext, and EventType.CLOSE / EventType.ERROR handling guidance. Prompt #2 (71/100) - Live streaming with interim results display Eval said 'all examples show interim_results=False, which is the opposite of what's needed, and none demonstrate how to differentiate between interim and final results or how to handle the display logic'. deepgram-python-speech-to-text/SKILL.md: - Rewrote the WebSocket quick-start to pass interim_results=True, utterance_end_ms=1000, vad_events=True, with a real overwrite-line pattern that shows interim results live and commits the line on final. - Added an 'Interim vs. final flag semantics' subsection explaining is_final, speech_final, and from_finalize distinctions and when each fires. Prompt #5 (83/100) - Diarization + word-level timings combined Eval said the skill 'lacks a specific, complete code example showing how to enable both diarization and word-level timings together in a single request'. deepgram-python-audio-intelligence/SKILL.md: - New 'Quick start - diarization with word-level timings' section: one focused snippet enabling diarize=True with per-word iteration showing speaker, start, end, confidence, punctuated_word. - Added a per-word fields table (word, punctuated_word, start, end, confidence, speaker, speaker_confidence) plus a groupby-by-speaker pattern and pointers to utterances=True / paragraphs=True for pre-grouped views. Prompt #8 (83/100) - Async URL transcription + retrieve final result Eval said the skill 'lacks critical information about handling asynchronous results - while it mentions callback functionality, it doesn't explain how to retrieve the final transcription when using async methods or how to poll for results'. deepgram-python-speech-to-text/SKILL.md: - New 'Async / deferred result patterns' section explicitly distinguishing Python async/await (sync-style, immediate result via AsyncDeepgramClient) from deferred via callback URL (returns request_id immediately, results POST'd to webhook later, no polling). - Decision table mapping each pattern to when to use it, with pointer to examples/12-transcription-prerecorded-callback.py. Net: +276 lines targeting ~97 missing benchmark points (potential lift 88.8 -> ~98 once Context7 reindexes).
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the in-repo Context7 “skills” documentation for the Deepgram Python SDK to address several benchmark prompt gaps, primarily by adding more complete, runnable examples and clarifying behavioral semantics (interim vs final streaming, mid-session agent updates, diarization + word timings, and async patterns).
Changes:
- Added mid-session voice agent control-message examples (prompt/think/speak updates, message injection, keep-alives) and reconnection/context replay guidance.
- Reworked live WebSocket transcription quick-start to demonstrate
interim_results=Truewith clear interim-vs-final display handling and clarified result flags. - Added a focused diarization + per-word timing quick-start and expanded async/deferred transcription guidance for prerecorded URL transcription.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.agents/skills/deepgram-python-voice-agent/SKILL.md |
Adds dynamic mid-session update examples and stream lifecycle/recovery guidance for Agent V1. |
.agents/skills/deepgram-python-speech-to-text/SKILL.md |
Updates live streaming quick-start for interim results and adds async/deferred result patterns + flag semantics. |
.agents/skills/deepgram-python-audio-intelligence/SKILL.md |
Adds a diarization + word-level timings quick-start and per-word field reference table. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… feedback) Both Copilot threads on PR #699: - deepgram-python-speech-to-text/SKILL.md interim-results snippet used `global last_interim_len` but the variable was defined in the enclosing `with` block, not at module scope. That would raise NameError on the first read. Replaced with a mutable closure (`state = {...}` dict), which is the idiomatic pattern when a callback needs to mutate state inside a `with` block. - deepgram-python-voice-agent/SKILL.md said the server emits a 'History event (type agent_v1history)'. `agent_v1history` is the internal Python module/file name, not the wire `type` literal. The wire `type` is `"History"` and the Python class is `AgentV1History`. Reworded so readers don't pattern-match on the wrong identifier.
GregHolmes
approved these changes
Apr 27, 2026
GregHolmes
pushed a commit
that referenced
this pull request
May 6, 2026
🤖 I have created a release *beep* *boop* --- ## [7.1.0](v7.0.0...v7.1.0) (2026-05-06) ### Features * update generated SDK models and restore agent settings compatibility ([#705](#705)) ([0b820c9](0b820c9)) ### Documentation * target Context7 benchmark gaps in Python skills [no-ci] ([#699](#699)) ([a232eb8](a232eb8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the four largest gaps in the Context7 benchmark for
/deepgram/deepgram-python-sdk. Current score: 88.8/100 (mean across 10 standardized prompts). The 4 weakest prompts account for ~97 of the 112 missing points; this PR addresses each one specifically.What's broken (Context7 evaluator quotes)
interim_results=False, which is the opposite of what's needed, and none demonstrate how to differentiate between interim and final results or how to handle the display logic"Changes
deepgram-python-voice-agent/SKILL.md(+139 lines, prompt #1)send_update_prompt(AgentV1UpdatePrompt)— swap LLM system prompt mid-conversationsend_update_speak(AgentV1UpdateSpeak)— swap TTS voicesend_update_think(AgentV1UpdateThink)— swap LLM provider/modelsend_inject_agent_message(...)— force agent to say somethingsend_inject_user_message(...)— inject user inputsend_keep_alive(...)— idle keep-aliveAgentV1SettingsAgentContext,EventType.CLOSE/EventType.ERRORhandlingdeepgram-python-speech-to-text/SKILL.md(+103 lines, prompts #2 + #8)Prompt #2:
interim_results=True,utterance_end_ms=1000,vad_events=Trueis_final,speech_final,from_finalizedistinctionsPrompt #8:
async/await(sync-style, immediate result viaAsyncDeepgramClient) from deferred via callback URL (returnsrequest_idimmediately, results POST'd to webhook later — no polling)examples/12-transcription-prerecorded-callback.pydeepgram-python-audio-intelligence/SKILL.md(+41 lines, prompt #5)diarize=True, smart_format=True, punctuate=True+ per-word iteration accessingspeaker,start,end,confidence,punctuated_wordgroupby-by-speaker utterance pattern + pointer toutterances=True/paragraphs=Truefor pre-grouped viewsExpected lift
If every gap closes:
Total potential: +77 / 1000 (across 10 prompts) = 88.8 → ~96.5 benchmark score.
After merge
Trigger Context7 refresh on
/deepgram/deepgram-python-sdkto pull the new content into the index, then re-run the benchmark to verify the lift.